!pip install swifter
Collecting swifter Downloading swifter-1.0.9-py3-none-any.whl (14 kB) Requirement already satisfied: pandas>=1.0.0 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (1.1.3) Requirement already satisfied: ipywidgets>=7.0.0 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (7.5.1) Requirement already satisfied: cloudpickle>=0.2.2 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (1.6.0) Requirement already satisfied: parso>0.4.0 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (0.7.0) Requirement already satisfied: tqdm>=4.33.0 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (4.50.2) Requirement already satisfied: dask[dataframe]>=2.10.0 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (2.30.0) Requirement already satisfied: psutil>=5.6.6 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (5.7.2) Requirement already satisfied: bleach>=3.1.1 in c:\users\bill\anaconda3\lib\site-packages (from swifter) (3.2.1) Requirement already satisfied: pytz>=2017.2 in c:\users\bill\anaconda3\lib\site-packages (from pandas>=1.0.0->swifter) (2020.1) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\bill\anaconda3\lib\site-packages (from pandas>=1.0.0->swifter) (2.8.1) Requirement already satisfied: numpy>=1.15.4 in c:\users\bill\anaconda3\lib\site-packages (from pandas>=1.0.0->swifter) (1.19.2) Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\users\bill\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->swifter) (7.19.0) Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\bill\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->swifter) (3.5.1) Requirement already satisfied: ipykernel>=4.5.1 in c:\users\bill\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->swifter) (5.3.4) Requirement already satisfied: traitlets>=4.3.1 in c:\users\bill\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->swifter) (5.0.5) Requirement already satisfied: nbformat>=4.2.0 in c:\users\bill\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->swifter) (5.0.8) Requirement already satisfied: pyyaml in c:\users\bill\anaconda3\lib\site-packages (from dask[dataframe]>=2.10.0->swifter) (5.3.1) Requirement already satisfied: toolz>=0.8.2; extra == "dataframe" in c:\users\bill\anaconda3\lib\site-packages (from dask[dataframe]>=2.10.0->swifter) (0.11.1) Requirement already satisfied: fsspec>=0.6.0; extra == "dataframe" in c:\users\bill\anaconda3\lib\site-packages (from dask[dataframe]>=2.10.0->swifter) (0.8.3) Requirement already satisfied: partd>=0.3.10; extra == "dataframe" in c:\users\bill\anaconda3\lib\site-packages (from dask[dataframe]>=2.10.0->swifter) (1.1.0) Requirement already satisfied: six>=1.9.0 in c:\users\bill\anaconda3\lib\site-packages (from bleach>=3.1.1->swifter) (1.15.0) Requirement already satisfied: packaging in c:\users\bill\anaconda3\lib\site-packages (from bleach>=3.1.1->swifter) (20.4) Requirement already satisfied: webencodings in c:\users\bill\anaconda3\lib\site-packages (from bleach>=3.1.1->swifter) (0.5.1) Requirement already satisfied: backcall in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (0.2.0) Requirement already satisfied: setuptools>=18.5 in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (50.3.1.post20201107) Requirement already satisfied: decorator in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (4.4.2) Requirement already satisfied: jedi>=0.10 in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (0.17.1) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (3.0.8) Requirement already satisfied: pickleshare in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (0.7.5) Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (0.4.4) Requirement already satisfied: pygments in c:\users\bill\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (2.7.2) Requirement already satisfied: notebook>=4.4.1 in c:\users\bill\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (6.1.4) Requirement already satisfied: jupyter-client in c:\users\bill\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->swifter) (6.1.7) Requirement already satisfied: tornado>=4.2 in c:\users\bill\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->swifter) (6.0.4) Requirement already satisfied: ipython-genutils in c:\users\bill\anaconda3\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.0.0->swifter) (0.2.0) Requirement already satisfied: jupyter-core in c:\users\bill\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->swifter) (4.6.3) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\bill\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.0.0->swifter) (3.2.0) Requirement already satisfied: locket in c:\users\bill\anaconda3\lib\site-packages (from partd>=0.3.10; extra == "dataframe"->dask[dataframe]>=2.10.0->swifter) (0.2.0) Requirement already satisfied: pyparsing>=2.0.2 in c:\users\bill\anaconda3\lib\site-packages (from packaging->bleach>=3.1.1->swifter) (2.4.7) Requirement already satisfied: wcwidth in c:\users\bill\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.0.0->swifter) (0.2.5) Requirement already satisfied: prometheus-client in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.8.0) Requirement already satisfied: argon2-cffi in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (20.1.0) Requirement already satisfied: Send2Trash in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.5.0) Requirement already satisfied: terminado>=0.8.3 in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.9.1) Requirement already satisfied: nbconvert in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (6.0.7) Requirement already satisfied: pyzmq>=17 in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (19.0.2) Requirement already satisfied: jinja2 in c:\users\bill\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (2.11.2) Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\bill\anaconda3\lib\site-packages (from jupyter-core->nbformat>=4.2.0->ipywidgets>=7.0.0->swifter) (227) Requirement already satisfied: attrs>=17.4.0 in c:\users\bill\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->swifter) (20.3.0) Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\bill\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.0.0->swifter) (0.17.3) Requirement already satisfied: cffi>=1.0.0 in c:\users\bill\anaconda3\lib\site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.14.3) Requirement already satisfied: pywinpty>=0.5 in c:\users\bill\anaconda3\lib\site-packages (from terminado>=0.8.3->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.5.7) Requirement already satisfied: entrypoints>=0.2.2 in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.3) Requirement already satisfied: jupyterlab-pygments in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.1.2) Requirement already satisfied: defusedxml in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.6.0) Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.8.4) Requirement already satisfied: testpath in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.4.4) Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.4.3) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\bill\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (0.5.1) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\bill\anaconda3\lib\site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.1.1) Requirement already satisfied: pycparser in c:\users\bill\anaconda3\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (2.20) Requirement already satisfied: async-generator in c:\users\bill\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.10) Requirement already satisfied: nest-asyncio in c:\users\bill\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.0.0->swifter) (1.4.2) Installing collected packages: swifter Successfully installed swifter-1.0.9
import pandas as pd
import numpy as np
import swifter
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
%matplotlib inline
import seaborn as sns
import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
data = pd.read_csv('datatest.txt').reset_index(drop = True)
data.head(5)
| date | Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | |
|---|---|---|---|---|---|---|---|
| 0 | 2015-02-02 14:19:00 | 23.7000 | 26.272 | 585.200000 | 749.200000 | 0.004764 | 1 |
| 1 | 2015-02-02 14:19:59 | 23.7180 | 26.290 | 578.400000 | 760.400000 | 0.004773 | 1 |
| 2 | 2015-02-02 14:21:00 | 23.7300 | 26.230 | 572.666667 | 769.666667 | 0.004765 | 1 |
| 3 | 2015-02-02 14:22:00 | 23.7225 | 26.125 | 493.750000 | 774.750000 | 0.004744 | 1 |
| 4 | 2015-02-02 14:23:00 | 23.7540 | 26.200 | 488.600000 | 779.000000 | 0.004767 | 1 |
data.index = pd.to_datetime(data['date'])
pd.DataFrame(data.resample('H').agg({'Temperature':'mean',
'Humidity':'mean',
'Light':'last',
'CO2':'last',
'HumidityRatio':'mean',
'Occupancy':'mean'})).head(5)
| Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | |
|---|---|---|---|---|---|---|
| date | ||||||
| 2015-02-02 14:00:00 | 23.657118 | 27.023720 | 470.333333 | 1024.666667 | 0.004889 | 1.000000 |
| 2015-02-02 15:00:00 | 23.293950 | 28.412430 | 429.000000 | 1059.600000 | 0.005030 | 1.000000 |
| 2015-02-02 16:00:00 | 22.773142 | 26.737452 | 428.200000 | 847.400000 | 0.004585 | 1.000000 |
| 2015-02-02 17:00:00 | 22.534520 | 24.972128 | 419.000000 | 773.600000 | 0.004217 | 0.610169 |
| 2015-02-02 18:00:00 | 21.993372 | 24.595967 | 0.000000 | 638.000000 | 0.004018 | 0.083333 |
To account for missing records. It is important to account for those records since you might want to put in 0 values if there were no records or use the previous or next time steps for imputation. I removed records for hour 15 to show how you can use the hour 14 timestamp to impute the missing value:
data = pd.read_csv('datatest.txt').reset_index(drop = True)
data_missing_records = data[~(pd.to_datetime(data.date).dt.hour == 15)].reset_index(drop = True)
data_missing_records.index = pd.to_datetime(data_missing_records['date'])
data_missing_records.resample('H', base=1).agg({'Temperature':'mean',
'Humidity':'mean',
'Light':'last',
'CO2':'last',
'HumidityRatio' : 'mean',
'Occupancy' : 'mean'}).fillna(method = 'ffill').head(5)
<ipython-input-13-f77329254095>:1: FutureWarning: 'base' in .resample() and in Grouper() is deprecated.
The new arguments that you should use are 'offset' or 'origin'.
>>> df.resample(freq="3s", base=2)
becomes:
>>> df.resample(freq="3s", offset="2s")
data_missing_records.resample('H', base=1).agg({'Temperature':'mean',
| Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | |
|---|---|---|---|---|---|---|
| date | ||||||
| 2015-02-02 14:00:00 | 23.657118 | 27.023720 | 470.333333 | 1024.666667 | 0.004889 | 1.000000 |
| 2015-02-02 15:00:00 | 23.657118 | 27.023720 | 470.333333 | 1024.666667 | 0.004889 | 1.000000 |
| 2015-02-02 16:00:00 | 22.773142 | 26.737452 | 428.200000 | 847.400000 | 0.004585 | 1.000000 |
| 2015-02-02 17:00:00 | 22.534520 | 24.972128 | 419.000000 | 773.600000 | 0.004217 | 0.610169 |
| 2015-02-02 18:00:00 | 21.993372 | 24.595967 | 0.000000 | 638.000000 | 0.004018 | 0.083333 |
data['Temp_Bands'] = np.round(data['Temperature'])
fig = px.line(data, x = 'date',
y = 'HumidityRatio',
color = 'Temp_Bands',
title = 'Humidity Ratio across dates as a function of Temperature Bands',
labels = {'date' : 'Time Stamp',
'HumidityRatio' : 'Humidity Ratio',
'Temp_Bands' : 'Temperature Band'})
fig.show()
I sometimes run into long wait times for processing pandas columns even with running code on a notebook with a large instance. Instead, there is an easy one word addition that can be used to speed up the apply functionality in a pandas DataFrame. One only has to import the library swifter.
import swifter
def custom(num1, num2):
if num1 > num2:
if num1 < 0:
return "Greater Negative"
else:
return "Greater Positive"
elif num2 > num1:
if num2 < 0:
return "Less Negative"
else:
return "Less Positive"
else:
return "Rare Equal"
data_sample = pd.DataFrame(np.random.randint(-10000, 10000, size = (50000000, 2)), columns = list('XY'))
# created a 50 million rows DataFrame and compared the time taken to process it via swifter apply() vs the vanilla apply().
# I also created a dummy function with simple if else conditions to test the two approaches on.
%%time
results_arr = data_sample.apply(lambda x : custom(x['X'], x['Y']), axis = 1)
Wall time: 13min 38s
%%time
results_arr = data_sample.swifter.apply(lambda x : custom(x['X'], x['Y']), axis = 1)
Wall time: 8min 45s Parser : 181 ms
We are able to reduce the processing time by 64.4% from 13 minutes 38 seconds to 8 minutes 45 seconds.
While we are on the topic of decreasing time complexity, I often end up dealing with datasets that I wish to process at multiple granularities. Using multiprocessing in python helps me save that time by utilizing multiple workers.
I demonstrate the effectiveness of multiprocessing using the same 50 million rows data frame I created above. Except this time I add a categorical variable which is a random value selected out of a set of vowels.
import random
string = 'AEIOU'
data_sample = pd.DataFrame(np.random.randint(-10000, 10000, size = (50000000, 2)),
columns = list('XY'))
data_sample['random_char'] = random.choices(string,
k = data_sample.shape[0])
unique_char = data_sample['random_char'].unique()
I used a for loop vs the Process Pool executor from concurrent.futures to demonstrate the runtime reduction we can achieve.
%%time
arr = []
for i in range(len(data_sample)):
num1 = data_sample.X.iloc[i]
num2 = data_sample.Y.iloc[i]
if num1 > num2:
if num1 < 0:
arr.append("Greater Negative")
else:
arr.append("Greater Positive")
elif num2 > num1:
if num2 < 0:
arr.append("Less Negative")
else:
arr.append("Less Positive")
else:
arr.append("Rare Equal")
Wall time: 23min 46s
def custom_multiprocessing(i):
sample = data_sample[data_sample['random_char'] == \
unique_char[i]]
arr = []
for j in range(len(sample)):
if num1 > num2:
if num1 < 0:
arr.append("Greater Negative")
else:
arr.append("Greater Positive")
elif num2 > num1:
if num2 < 0:
arr.append("Less Negative")
else:
arr.append("Less Positive")
else:
arr.append("Rare Equal")
sample['values'] = arr
return sample
# function that allows me to process each vowel grouping separately:
%%time
import concurrent
def main():
aggregated = pd.DataFrame()
with concurrent.futures.ProcessPoolExecutor(max_workers = 5) as executor:
results = executor.map(custom_multiprocessing, range(len(unique_char)))
if __name__ == '__main__':
main()
Wall time: 0 ns
We see a reduction of CPU time by 99.3%. Though one must remember to use these methods carefully since they will not serialize the output therefore using them via grouping can be a good means to leverage this capability.We see a reduction of CPU time by 99.3%. Though one must remember to use these methods carefully since they will not serialize the output therefore using them via grouping can be a good means to leverage this capability.
With the rise of using Machine Learning and Deep Learning approaches for time series forecasting, it is essential to use a metric NOT just based on the distance between predicted and actual value. A metric for a forecasting model should use errors from the temporal trend as well to evaluate how well a model is performing instead of just point in time error estimates. Enter Mean Absolute Scaled Error! This metric that takes into account the error we would get if we used a random walk approach where last timestamp’s value would be the forecast for the next timestamp. It compares the error from the model to the error from the naive forecast.
def MASE(y_train, y_test, pred):
naive_error = np.sum(np.abs(np.diff(y_train)))/(len(y_train)-1)
model_error = np.mean(np.abs(y_test - pred))
return model_error/naive_error
If MASE > 1 then the model is performing worse than a random walk. The closer the MASE is to 0, the better the forecasting model.
!jupyter nbconvert RoomOccupancy.ipynb --to html